{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "H_D9kG_efts3"
   },
   "source": [
    "# Transformers 模型量化技术：AWQ（OPT-2.7B）"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "WE9IhcVyktah"
   },
   "source": [
    "![img](https://huggingface.co/datasets/ybelkada/documentation-images/resolve/main/Thumbnail.png)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "id": "Wwsg6nCwoThm"
   },
   "source": [
    "在2023年6月，Ji Lin等人发表了论文 [AWQ：Activation-aware Weight Quantization for LLM Compression and Acceleration](https://arxiv.org/pdf/2306.00978.pdf)。\n",
    "\n",
    "这篇论文详细介绍了一种激活感知权重量化算法，可以用于压缩任何基于 Transformer 的语言模型，同时只有微小的性能下降。关于 AWQ 算法的详细介绍，见[MIT Han Song 教授分享](https://hanlab.mit.edu/projects/awq)。\n",
    "\n",
    "transformers 现在支持两个不同的 AWQ 开源实现库：\n",
    "\n",
    "- [AutoAWQ](https://github.com/casper-hansen/AutoAWQ)\n",
    "- [LLM-AWQ](https://github.com/mit-han-lab/llm-awq) \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "-H2019RkoiM-"
   },
   "source": [
    "因为 LLM-AWQ 不支持 Nvidia T4 GPU（课程演示 GPU），所以我们使用 AutoAWQ 库来介绍和演示 AWQ 模型量化技术。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "6dJJRQ2p7eLQ"
   },
   "source": [
    "## 使用 AutoAWQ 量化模型\n",
    "\n",
    "下面我们以 `facebook opt-2.7B` 模型为例，使用 `AutoAWQ` 库实现的 AWQ 算法实现模型量化。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "from awq import AutoAWQForCausalLM\n",
    "from transformers import AutoTokenizer\n",
    "\n",
    "model_name_or_path = \"facebook/opt-2.7b\"\n",
    "quant_model_dir = \"models/opt-2.7b-awq\"\n",
    "\n",
    "quant_config = {\n",
    "    \"zero_point\": True,\n",
    "    \"q_group_size\": 128,\n",
    "    \"w_bit\": 4,\n",
    "    \"version\": \"GEMM\"\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "379a0bba9ee74953b5e1facf448da666",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Fetching 8 files:   0%|          | 0/8 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# 加载模型\n",
    "model = AutoAWQForCausalLM.from_pretrained(model_name_or_path, trust_remote_code=True)\n",
    "tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "id": "Qn_P_E5p7gAN"
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/root/miniconda3/lib/python3.11/site-packages/huggingface_hub/repocard.py:105: UserWarning: Repo card metadata block was not found. Setting CardData to empty.\n",
      "  warnings.warn(\"Repo card metadata block was not found. Setting CardData to empty.\")\n",
      "AWQ: 100%|██████████| 32/32 [16:38<00:00, 31.21s/it]\n"
     ]
    }
   ],
   "source": [
    "# 量化模型\n",
    "model.quantize(tokenizer, quant_config=quant_config)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 实测 AWQ 量化模型：GPU显存占用峰值超过10GB\n",
    "\n",
    "\n",
    "```shell\n",
    "Sun Dec 24 15:21:35 2023\n",
    "+---------------------------------------------------------------------------------------+\n",
    "| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |\n",
    "|-----------------------------------------+----------------------+----------------------+\n",
    "| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |\n",
    "| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |\n",
    "|                                         |                      |               MIG M. |\n",
    "|=========================================+======================+======================|\n",
    "|   0  Tesla T4                       Off | 00000000:00:0D.0 Off |                    0 |\n",
    "| N/A   53C    P0              71W /  70W |   7261MiB / 15360MiB |     97%      Default |\n",
    "|                                         |                      |                  N/A |\n",
    "+-----------------------------------------+----------------------+----------------------+\n",
    "\n",
    "+---------------------------------------------------------------------------------------+\n",
    "| Processes:                                                                            |\n",
    "|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |\n",
    "|        ID   ID                                                             Usage      |\n",
    "|=======================================================================================|```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "id": "nVzKDBlP_6MV"
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'zero_point': True, 'q_group_size': 128, 'w_bit': 4, 'version': 'GEMM'}"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "quant_config"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "PuPLq9sa8EaN"
   },
   "source": [
    "#### Transformers 兼容性配置\n",
    "\n",
    "为了使`quant_config` 与 transformers 兼容，我们需要修改配置文件：`使用 Transformers.AwqConfig 来实例化量化模型配置`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "id": "KE8xjwlL8DnA"
   },
   "outputs": [],
   "source": [
    "from transformers import AwqConfig, AutoConfig\n",
    "\n",
    "# 修改配置文件以使其与transformers集成兼容\n",
    "quantization_config = AwqConfig(\n",
    "    bits=quant_config[\"w_bit\"],\n",
    "    group_size=quant_config[\"q_group_size\"],\n",
    "    zero_point=quant_config[\"zero_point\"],\n",
    "    version=quant_config[\"version\"].lower(),\n",
    ").to_dict()\n",
    "\n",
    "# 预训练的transformers模型存储在model属性中，我们需要传递一个字典\n",
    "model.model.config.quantization_config = quantization_config"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "('models/opt-2.7b-awq/tokenizer_config.json',\n",
       " 'models/opt-2.7b-awq/special_tokens_map.json',\n",
       " 'models/opt-2.7b-awq/vocab.json',\n",
       " 'models/opt-2.7b-awq/merges.txt',\n",
       " 'models/opt-2.7b-awq/added_tokens.json',\n",
       " 'models/opt-2.7b-awq/tokenizer.json')"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 保存模型权重\n",
    "model.save_quantized(quant_model_dir)\n",
    "# 保存分词器\n",
    "tokenizer.save_pretrained(quant_model_dir)  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "OptAWQForCausalLM(\n",
       "  (model): OPTForCausalLM(\n",
       "    (model): OPTModel(\n",
       "      (decoder): OPTDecoder(\n",
       "        (embed_tokens): Embedding(50272, 2560, padding_idx=1)\n",
       "        (embed_positions): OPTLearnedPositionalEmbedding(2050, 2560)\n",
       "        (final_layer_norm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True)\n",
       "        (layers): ModuleList(\n",
       "          (0-31): 32 x OPTDecoderLayer(\n",
       "            (self_attn): OPTAttention(\n",
       "              (k_proj): WQLinear_GEMM(in_features=2560, out_features=2560, bias=True, w_bit=4, group_size=128)\n",
       "              (v_proj): WQLinear_GEMM(in_features=2560, out_features=2560, bias=True, w_bit=4, group_size=128)\n",
       "              (q_proj): WQLinear_GEMM(in_features=2560, out_features=2560, bias=True, w_bit=4, group_size=128)\n",
       "              (out_proj): WQLinear_GEMM(in_features=2560, out_features=2560, bias=True, w_bit=4, group_size=128)\n",
       "            )\n",
       "            (activation_fn): ReLU()\n",
       "            (self_attn_layer_norm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True)\n",
       "            (fc1): WQLinear_GEMM(in_features=2560, out_features=10240, bias=True, w_bit=4, group_size=128)\n",
       "            (fc2): WQLinear_GEMM(in_features=10240, out_features=2560, bias=True, w_bit=4, group_size=128)\n",
       "            (final_layer_norm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True)\n",
       "          )\n",
       "        )\n",
       "      )\n",
       "    )\n",
       "    (lm_head): Linear(in_features=2560, out_features=50272, bias=False)\n",
       "  )\n",
       ")"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model.eval()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 使用 GPU 加载量化模型"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "from transformers import AutoTokenizer, AutoModelForCausalLM\n",
    "\n",
    "tokenizer = AutoTokenizer.from_pretrained(quant_model_dir)\n",
    "model = AutoModelForCausalLM.from_pretrained(quant_model_dir, device_map=\"cuda\").to(0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "def generate_text(text):\n",
    "    inputs = tokenizer(text, return_tensors=\"pt\").to(0)\n",
    "\n",
    "    out = model.generate(**inputs, max_new_tokens=64)\n",
    "    return tokenizer.decode(out[0], skip_special_tokens=True)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Merry Christmas! I'm glad to. M-M-M-M-M-M-M-M-M-M-M- M-M-M-M-1-\n",
      "\n",
      "M-M-M-M-M-M-M-M-M-M-M-M-M-M-M\n"
     ]
    }
   ],
   "source": [
    "result = generate_text(\"Merry Christmas! I'm glad to\")\n",
    "print(result)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "id": "Z0hAXYanCDW3"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The woman worked as a the the woman.\n",
      "The the man\n",
      "the the the the the woman\n"
     ]
    }
   ],
   "source": [
    "result = generate_text(\"The woman worked as a\")\n",
    "print(result)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "accelerator": "GPU",
  "colab": {
   "gpuType": "T4",
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
